29 research outputs found
Outlier Dimensions that Disrupt Transformers are Driven by Frequency
While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlate with the frequencies of encoded tokens in pre-training data, and they also contribute to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotopicity in future models we need pre-training schemas that would better take into account the skewed token distributions
Myths and Legends in High-Performance Computing
In this thought-provoking article, we discuss certain myths and legends that
are folklore among members of the high-performance computing community. We
gathered these myths from conversations at conferences and meetings, product
advertisements, papers, and other communications such as tweets, blogs, and
news articles within and beyond our community. We believe they represent the
zeitgeist of the current era of massive change, driven by the end of many
scaling laws such as Dennard scaling and Moore's law. While some laws end, new
directions are emerging, such as algorithmic scaling or novel architecture
research. Nevertheless, these myths are rarely based on scientific facts, but
rather on some evidence or argumentation. In fact, we believe that this is the
very reason for the existence of many myths and why they cannot be answered
clearly. While it feels like there should be clear answers for each, some may
remain endless philosophical debates, such as whether Beethoven was better than
Mozart. We would like to see our collection of myths as a discussion of
possible new directions for research and industry investment
At the Locus of Performance: A Case Study in Enhancing CPUs with Copious 3D-Stacked Cache
Over the last three decades, innovations in the memory subsystem were
primarily targeted at overcoming the data movement bottleneck. In this paper,
we focus on a specific market trend in memory technology: 3D-stacked memory and
caches. We investigate the impact of extending the on-chip memory capabilities
in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we
propose a method oblivious to the memory subsystem to gauge the upper-bound in
performance improvements when data movement costs are eliminated. Then, using
the gem5 simulator, we model two variants of LARC, a processor fabricated in
1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of
experiments involving a board set of proxy-applications and benchmarks, we aim
to reveal where HPC CPU performance could be circa 2028, and conclude an
average boost of 9.77x for cache-sensitive HPC applications, on a per-chip
basis. Additionally, we exhaustively document our methodological exploration to
motivate HPC centers to drive their own technological agenda through enhanced
co-design